1 Data managing

We load the speeches made by legislators in the UK House of Commons in the 2014 calendar year, with which we will be working with throughout this seminar, regarding Structural topic modeling in order to understand how parliamentary speeches varies by the gender of the MP.

# We load the data that will be used
PATH <- '~/Desktop/TERM 1/SMI/seminar5/data'
load("~/Desktop/TERM 1/SMI/seminar5/data/hoc_speeches.RData")

1.1 Summary statistics

Now, we can perform some summary statistics to see the general patterns and a quick explanation of our data.

summary(speeches)
      date                name              party              gender        speech              ntoken      
 Min.   :2014-01-06   Length:7590        Length:7590        female:1670   Length:7590        Min.   :   1.0  
 1st Qu.:2014-03-12   Class :character   Class :character   male  :5920   Class :character   1st Qu.:  50.0  
 Median :2014-06-23   Mode  :character   Mode  :character                 Mode  :character   Median :  80.0  
 Mean   :2014-06-24                                                                          Mean   : 171.6  
 3rd Qu.:2014-10-21                                                                          3rd Qu.: 132.0  
 Max.   :2014-12-18                                                                          Max.   :6342.0  

We can see that there is a signficant, gender-wise class imbalance gender-wise, since female speeches are just a 22% while male speeches correspond to the 78% out of the 7590 speeches contained in the dataset.

Then, we can check for duplicated speeches contained in the data and see that there are, in fact, some duplicated ones: 7139 unique speeches versus the original 7590 ones.

length(unique(speeches$speech))
[1] 7139

However, since they may have been performed by different members of the party (therefore, maybe performed by men and women), we will not drop them, just to make sure we do not miss information by furtherly imbalancing the dataset.

Finally, we can check the number of speeches performed by each party in the UK House of Commons:

speeches %>% 
  group_by(party) %>%
    summarise(observations = length(party))

where we can see that 56% have been performed by the Conservative party, 34% by the Labour party and 10% by the Liberal Democrat party, therefore checking which party may have for representatives in the House of Commons (hence, more decision power).

1.2 Creating and analyzing the corpus data

Following, we will use the functions in the quanteda package to turn this data into a corpus object. We will save the speeches data with the meta parameter of the corpus() function, so we can later access it. Furthermore, we can also check all the detected languages of the speeches, which are logically all in english.

speeches_corpus <- corpus(speeches, text_field = 'speech', meta = speeches)
languages <- detect_language(speeches_corpus)
table(languages)
languages
  en 
7590 

To study the corpus more thoroughly, we can plot the length of each document with the ntoken() function. Here, with the not-cleaned original data, tokens will be words, punctuation characters, numbers… Let’s plot it and further analyze it.

ntokens_speeches <- ntoken(speeches_corpus)
data.frame(ntokens_speeches) %>% ggplot(aes(ntokens_speeches)) + geom_histogram() + xlab('Number of tokens')

First, we can see that there are many speeches with less than 250 tokens, approximately (where a lot of them seem to have a null-lenght). Furthermore, we can also see that the maximum amount of tokens that speeches have is around 2500 tokens. However, the amount of speeches with this high number of tokens is very low. Therefore, all things considered, we will filter the speeches, so that we just use those with a number of tokens between 50 and 1000 tokens.

docs_length50_1000 <- names(ntokens_speeches[(ntokens_speeches>=50) & (ntokens_speeches<=1000)])
speeches_corpus <- speeches_corpus[names(speeches_corpus) %in% docs_length50_1000]

We can now furtherly perform some summary statistics on the filtered data, and see the main features of each speech. We will just show the 10 first elements, so that the output is not unnecessarily long.

summary(speeches_corpus, 10)
Corpus consisting of 6057 documents, showing 10 documents:

   Text Types Tokens Sentences       date            name        party gender ntoken
  text1    78    109         5 2014-03-13       Lyn Brown       Labour female    109
  text2    66     97         8 2014-09-09   Cheryl Gillan Conservative female     97
  text3    41     59         3 2014-04-08    James Morris Conservative   male     59
  text4    52     68         3 2014-12-08  Rehman Chishti Conservative   male     68
  text6   277    630        25 2014-01-09   Anne McIntosh Conservative female    642
  text7    48     69         4 2014-10-13 Debbie Abrahams       Labour female     71
  text8    38     54         2 2014-09-08   Brandon Lewis Conservative   male     54
  text9    81    113         4 2014-01-06    Andy Sawford       Labour   male    113
 text10   268    626        21 2014-03-05   Jeremy Lefroy Conservative   male    634
 text11    59     72         5 2014-01-16  Andrew Lansley Conservative   male     72

1.3 Cleaning the data

Now, with the filtered data we want to work with, we will firstly turn this corpus into a tokens object and then into a document-feature matrix, with all the preprocessing steps applied to our dataset. More particularly, we will use the following data-cleaning steps:

  • Punctuation, symbols and numbers removal
  • Stopwords removal: we will drop short and s. In order to select them, we will firstly consider the most common english words (with the function stopwords(“en”)), and then, we will keep adding them to the list as we keep searching for the most frequent words.
  • Stemming: we just keep a generic part of a word that may appear in many several specific words, in order to reduce dimensionality and .
  • Infrequent terms removal: we get rid of the words that rarely appear.
  • n_grams: takes into consideration each word and the combination with all other words; that is, n_grams of n=1 and n=2.

Then, we can display a small portion of the resulting cleaned data, where we can see the number of times each of the showed words appear in the first considered texts:

eng_sw <- stopwords("en")
our_stopwords <- c(eng_sw, 'hon', 'can','s','make','need','one','want','say','take','sure','however',
                   'may','much','must','now','go','use','two','get','said','made','also','look','put','see','give')
our_stopwords
  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves"  "you"       
 [10] "your"       "yours"      "yourself"   "yourselves" "he"         "him"        "his"        "himself"    "she"       
 [19] "her"        "hers"       "herself"    "it"         "its"        "itself"     "they"       "them"       "their"     
 [28] "theirs"     "themselves" "what"       "which"      "who"        "whom"       "this"       "that"       "these"     
 [37] "those"      "am"         "is"         "are"        "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"         "does"       "did"        "doing"      "would"     
 [55] "should"     "could"      "ought"      "i'm"        "you're"     "he's"       "she's"      "it's"       "we're"     
 [64] "they're"    "i've"       "you've"     "we've"      "they've"    "i'd"        "you'd"      "he'd"       "she'd"     
 [73] "we'd"       "they'd"     "i'll"       "you'll"     "he'll"      "she'll"     "we'll"      "they'll"    "isn't"     
 [82] "aren't"     "wasn't"     "weren't"    "hasn't"     "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"      "cannot"     "couldn't"   "mustn't"    "let's"     
[100] "that's"     "who's"      "what's"     "here's"     "there's"    "when's"     "where's"    "why's"      "how's"     
[109] "a"          "an"         "the"        "and"        "but"        "if"         "or"         "because"    "as"        
[118] "until"      "while"      "of"         "at"         "by"         "for"        "with"       "about"      "against"   
[127] "between"    "into"       "through"    "during"     "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"        "on"         "off"        "over"       "under"     
[145] "again"      "further"    "then"       "once"       "here"       "there"      "when"       "where"      "why"       
[154] "how"        "all"        "any"        "both"       "each"       "few"        "more"       "most"       "other"     
[163] "some"       "such"       "no"         "nor"        "not"        "only"       "own"        "same"       "so"        
[172] "than"       "too"        "very"       "will"       "hon"        "can"        "s"          "make"       "need"      
[181] "one"        "want"       "say"        "take"       "sure"       "however"    "may"        "much"       "must"      
[190] "now"        "go"         "use"        "two"        "get"        "said"       "made"       "also"       "look"      
[199] "put"        "see"        "give"      
dfm_speeches <- tokens(speeches_corpus, remove_punct = TRUE, 
                                          remove_symbols = TRUE, remove_numbers = TRUE) %>% 
                            tokens_remove(our_stopwords) %>%
                            tokens_wordstem() %>%
                            tokens_ngrams(n = c(1,2)) %>% 
                            dfm() %>% 
                            dfm_tolower() %>% 
                            dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
dfm_speeches
Document-feature matrix of: 6,057 documents, 3,267 features (98.45% sparse) and 5 docvars.
       features
docs    welcom friend statement report make clear procur link local govern
  text1      2      1         1      1    1     1      2    1     4      2
  text2      1      1         0      0    2     0      0    0     0      0
  text3      0      0         0      0    0     0      0    0     4      4
  text4      1      0         0      0    0     0      0    1     0      1
  text6      0      3         0      1    0     0      0    0     4      5
  text7      0      0         0      0    0     0      0    0     0      0
[ reached max_ndoc ... 6,051 more documents, reached max_nfeat ... 3,257 more features ]

Now, we can plot the length of each document again, now with the filtered, cleaned and preprocessed data, and see the changes we have done:

ntokens_speeches_2 <- ntoken(dfm_speeches)
data.frame(ntokens_speeches_2) %>% ggplot(aes(ntokens_speeches_2)) + geom_histogram() + xlab('Number of tokens')

In order to know if there are more meaningless words that have a high frequency of appearence in the documents, we can show the top \(n\) most usual words: by brute force, we can keep appending them to the list and re-run the code, until we find a set of most frequent words that could be considered as meaningful.

topfeatures(dfm_speeches, 50)
   govern     peopl    member     right    friend      work      hous    minist      year   support      time      bill 
     4064      3672      3307      3001      2711      2450      2354      2257      1999      1714      1674      1603 
    state     point      issu   countri    import     debat secretari      mani gentleman     think       way  constitu 
     1557      1522      1518      1512      1481      1428      1427      1388      1359      1315      1292      1265 
      new      know        us     local     place     chang      come      agre      busi    servic     clear     ensur 
     1236      1219      1215      1144      1134      1095      1092      1084      1069      1051      1026      1021 
     case  committe    number    labour      just      part      last      home       tax      help     power    provid 
      987       966       965       954       945       940       933       892       879       870       866       866 
 question    public 
      856       836 

1.4 Word cloud

Once we have already completely cleaned our data, with all the considered stopwords, we can plot the resulting word cloud, with the textplot_wordcloud() function, and see what we obtain.

textplot_wordcloud(dfm_speeches, random_order = FALSE, rotation = 0.25, 
    color = RColorBrewer::brewer.pal(9, "Set2"),max_words =100,max_size = 3)

1.4.1 Bigram word cloud

Now, we will additionally plot a word cloud just for bigrams, in order to see the most frequent-related words. Hence, we firstly create the document-feature matrix for bigrams, with all the preprocessing steps applied to our dataset.

bigrams <- tokens(speeches_corpus, remove_punct = TRUE, 
                                          remove_symbols = TRUE, remove_numbers = TRUE) %>% 
                            tokens_remove(our_stopwords) %>%
                            tokens_wordstem() %>%
                            tokens_ngrams(n = c(2)) %>% 
                            dfm() %>% 
                            dfm_tolower() %>% 
                            dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
bigrams
Document-feature matrix of: 6,057 documents, 844 features (99.42% sparse) and 5 docvars.
       features
docs    make_clear local_govern valu_money increas_number govern_support support_work work_done local_peopl friend_make
  text1          1            1          1              1              1            1         1           1           0
  text2          0            0          0              0              0            0         0           0           1
  text3          0            3          0              0              0            0         0           0           0
  text4          0            0          0              0              0            0         0           0           0
  text6          0            2          0              0              0            0         0           0           0
  text7          0            0          0              0              0            0         0           0           0
       features
docs    valid_point
  text1           0
  text2           1
  text3           0
  text4           0
  text6           0
  text7           0
[ reached max_ndoc ... 6,051 more documents, reached max_nfeat ... 834 more features ]

Now, we can plot the bigram length of each document and see the changes in comparison with all words:

bigram_ntokens <- ntoken(bigrams)
data.frame(bigram_ntokens) %>% ggplot(aes(bigram_ntokens)) + geom_histogram() + xlab('Number of bigrams')

Finally, we plot the final bigram word cloud, where we have reduced the maximmum number of bigrams to plot, since we saw in the above chart that that the X-axis (number of bigrams) was reduced to a maximmum value of 40.

textplot_wordcloud(bigrams, random_order = FALSE, rotation = 0.25, 
    color = RColorBrewer::brewer.pal(9, "Set2"),max_words =40,max_size = 3)

2 Structural Topic Model (STM)

2.1 Defining the STM

Following, we will run a structural topic model for this created corpus, using the gender variable as the topic-prevalence argument and setting \(K=20\) as the number of topics. With the stm() function that computes this topic modelling, we will obtain \(K\) topics which contain the corresponding words that are most related with each other, as we can see in the graph below:

stm_speeches <- stm(documents=dfm_speeches, prevalence = ~gender, 
                    K = 20, seed=123)

2.2 Plotting the results

plot(x=stm_speeches, type = 'summary', n=5)

For further analyze of each topic and the words it contains, we can show the four different types of word weightings that are used to assing words to each topic: highest probability (most probable words to appear), FREX (both frequent and exclusive), lift and score (from two other popular text mining packages).

labelTopics(stm_speeches)
Topic 1 Top Words:
     Highest Prob: bill, amend, claus, govern, new, hous, committe 
     FREX: amend, claus, new_claus, lord, bill, provis, tabl 
     Lift: lord_amend, govern_amend, subsect, amend, new_claus, tabl_amend, claus_amend 
     Score: lord_amend, amend, bill, claus, new_claus, amendment, lord 
Topic 2 Top Words:
     Highest Prob: legisl, data, govern, power, concern, act, protect 
     FREX: data, cycl, safeti, communic, legisl, mail, enforc 
     Lift: call_govern, communic_data, royal_mail, intercept, mail, cycl, gambl 
     Score: call_govern, legisl, data, cycl, mail, royal_mail, communic 
Topic 3 Top Words:
     Highest Prob: year, last, week, month, past, ago, three 
     FREX: ago, cancer, year_ago, last_week, four, women, church 
     Lift: three_four, first_world, church, four_half, cancer, commemor, three_year 
     Score: three_four, year, cancer, women, last, ago, year_ago 
Topic 4 Top Words:
     Highest Prob: question, minist, gentleman, ask, hous, whether, way 
     FREX: question, answer, ask, ladi, statement, order, written 
     Lift: explan, grate_minist, give_way, ask_question, soon_possibl, grate_ladi, urgent_question 
     Score: explan, question, answer, point_order, gentleman, ask, minist 
Topic 5 Top Words:
     Highest Prob: local, care, author, servic, fund, council, communiti 
     FREX: local, local_author, rural, author, council, fund, care 
     Lift: support_local, depart_environ, local_communiti, care_system, environ_food, food_rural, rural_affair 
     Score: local, support_local, local_author, author, rural, health, council 
Topic 6 Top Words:
     Highest Prob: constitu, invest, london, friend, plan, citi, new 
     FREX: rail, road, london, citi, infrastructur, invest, transport 
     Lift: factori, main_line, railway_line, high-spe, railway, high_speed, south-west 
     Score: factori, invest, rail, citi, london, railway, station 
Topic 7 Top Words:
     Highest Prob: member, debat, friend_member, friend, hous, mr, right 
     FREX: friend_member, member, right_member, member_parliament, mr, member_north, north 
     Lift: speech_member, member_gainsborough, edward_leigh, gainsborough_sir, sir_edward, north_mr, member_north 
     Score: member, speech_member, friend_member, mr, debat, north, right_member 
Topic 8 Top Words:
     Highest Prob: work, peopl, job, young, employ, young_peopl, wage 
     FREX: minimum_wage, minimum, wage, young_peopl, apprenticeship, young, employ 
     Lift: work_experi, low_pay, work_programm, nation_minimum, minimum_wage, support_work, jobseek_allow 
     Score: work_experi, wage, young_peopl, young, work, employ, apprenticeship 
Topic 9 Top Words:
     Highest Prob: busi, small, compani, bank, govern, insur, small_busi 
     FREX: small_busi, busi, insur, pub, small, busi_rate, corpor 
     Lift: small_busi, busi_rate, medium-s, small_medium-s, high_street, depart_busi, busi_innovat 
     Score: busi_rate, busi, small_busi, small, compani, bank, pub 
Topic 10 Top Words:
     Highest Prob: govern, labour, sinc, food, poverti, year, polici 
     FREX: poverti, labour_govern, credit, food, labour, food_bank, child_care 
     Lift: child_poverti, child_care, food_bank, poverti, univers_credit, live_crisi, inequ 
     Score: child_poverti, labour, poverti, food_bank, child_care, food, univers_credit 
Topic 11 Top Words:
     Highest Prob: school, children, state, secretari_state, secretari, educ, teacher 
     FREX: school, teacher, children, educ, free_school, educat, parent 
     Lift: head_teacher, meal, primari_school, free_school, teacher, pupil, school 
     Score: meal, school, teacher, children, free_school, educ, parent 
Topic 12 Top Words:
     Highest Prob: court, justic, right, prison, law, crimin, european 
     FREX: prison, justic, court, sentenc, arrest_warrant, offend, ukrain 
     Lift: legal_aid, arrest_warrant, european_arrest, magistr, judici_review, justic_system, offend 
     Score: legal_aid, court, justic, prison, russia, ukrain, crimin 
Topic 13 Top Words:
     Highest Prob: tax, money, govern, budget, spend, rate, million 
     FREX: tax, budget, spend, incom_tax, money, incom, borrow 
     Lift: council_tax, 50p, exchequer, incom_tax, annuiti, chancellor_exchequer, help_buy 
     Score: council_tax, tax, incom_tax, budget, incom, rate, money 
Topic 14 Top Words:
     Highest Prob: hous, peopl, benefit, home, sector, rent, privat 
     FREX: rent, landlord, disabl, mental, tenant, mental_health, privat_rent 
     Lift: disabl_live, live_allow, privat_rent, rent_sector, capabl_assess, work_capabl, landlord 
     Score: disabl_live, rent, disabl, tenant, landlord, privat_rent, rent_sector 
Topic 15 Top Words:
     Highest Prob: parti, wale, vote, elect, scotland, power, northern 
     FREX: scotland, ireland, northern_ireland, scottish, wale, northern, devolut 
     Lift: english_vote, peopl_scotland, vote_english, scottish_govern, scottish_referendum, english_law, scottish 
     Score: vote_english, scotland, northern_ireland, wale, devolut, ireland, scottish 
Topic 16 Top Words:
     Highest Prob: home, polic, forc, secretari, home_secretari, individu, crime 
     FREX: polic, home_secretari, passport, arm, investig, forc, arm_forc 
     Lift: polic_forc, rape, metropolitan_polic, polic_servic, passport, shadow_home, polic_offic 
     Score: polic_forc, polic, home_secretari, crime, passport, arm_forc, terror 
Topic 17 Top Words:
     Highest Prob: countri, minist, govern, state, right, prime, prime_minist 
     FREX: palestinian, israel, iraq, israeli, peac, european_union, isil 
     Lift: iran, nick, geneva, sunni, hama, assad, diplomat 
     Score: nick, palestinian, iraq, isil, israel, syria, israeli 
Topic 18 Top Words:
     Highest Prob: peopl, know, minist, gentleman, think, nhs, right 
     FREX: speaker, deputi, nhs, deputi_speaker, hospit, trust, mr_speaker 
     Lift: shout, madam, madam_deputi, health_secretari, deputi_speaker, deputi_prime, deputi 
     Score: shout, nhs, speaker, hospit, patient, deputi, prime_minist 
Topic 19 Top Words:
     Highest Prob: friend, right, point, committe, issu, rais, import 
     FREX: select_committe, select, import_point, rais, friend_make, clerk, rais_issu 
     Lift: philip_davi, member_shipley, shipley, philip, shipley_philip, right_rais, gentleman_rais 
     Score: philip_davi, committe, friend, select_committe, select, rais, right_friend 
Topic 20 Top Words:
     Highest Prob: energi, market, price, compani, consum, govern, right 
     FREX: energi, climat, climat_chang, price, gas, market, consum 
     Lift: ofgem, big_six, energi_price, energi_compani, climat_chang, carbon, electr 
     Score: big_six, energi, consum, price, market, gas, compani 

2.3 Top 3 documents per topic

In order to find the top three documents associated with each topic, we will use the plotQuote() function, in which we get an image with the top 3 original documents that are mostly related to each topic. We could also use the findThoughts() function (which we have mantained in the coding), but it displays the top 3 documents for each topic all together, therefore having a very ugly visualization display. Hence, we will directly use plotQuote() and nicely display the results, as below:

top3_docs <- findThoughts(model=stm_speeches,texts = speeches_corpus)
for (i in 1:20) {
  plotQuote(speeches_corpus[order(stm_speeches$theta[,i], decreasing = T)[1:3]], width = 150, 
          main = paste('topic: ', i))
}

As we can see, if we check each topic and its most relatable words, we will find that the texts are in correspondance with their found associated topics.

2.4 Topic comparison between men and women

Finally, we can check on which topics are women, on average, more active than men. In order to do that, we will plot the results provided by estimateEffect. Among the many parameters we can fix, the most important one is the method used to plot; we will use “difference”, since it clearly displays the comparison between men and women for each topic. By setting the first covariate to ‘men’ and the second to ‘female’, by taking the reference point at 0, each topic will be moved towards the right (if men are more active than women for that topic) or towards the left (if woman are more active than men). Furthermore, we can check this criterion by looking at the sign of the estimate value in the summary of the computed variable for the estimateEffect(). Since it displays the effect for men (gendermale), when the line is moved towards the right, the estimate value will be positive. Otherwise, it will be negative when the line is moved towards the left. Let’s show the results, so we can verify them.

gender_effect <- estimateEffect(formula = ~gender, stmobj = stm_speeches,metadata = 
                                  docvars(dfm_speeches))
summary(gender_effect)

Call:
estimateEffect(formula = ~gender, stmobj = stm_speeches, metadata = docvars(dfm_speeches))


Topic 1:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0476095  0.0040813  11.665   <2e-16 ***
gendermale  -0.0001506  0.0045783  -0.033    0.974    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 2:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.035701   0.003289  10.855   <2e-16 ***
gendermale  0.003040   0.003825   0.795    0.427    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 3:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0454110  0.0027941   16.25   <2e-16 ***
gendermale  0.0007707  0.0032169    0.24    0.811    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 4:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0744100  0.0038080  19.540   <2e-16 ***
gendermale  -0.0005825  0.0042270  -0.138     0.89    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 5:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.052432   0.003586   14.62   <2e-16 ***
gendermale  0.004711   0.004061    1.16    0.246    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 6:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.053743   0.003928  13.683   <2e-16 ***
gendermale  0.002761   0.004392   0.629     0.53    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 7:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.060877   0.003686  16.515   <2e-16 ***
gendermale  0.002595   0.004460   0.582    0.561    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 8:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.050769   0.003659  13.876   <2e-16 ***
gendermale  -0.003078   0.004100  -0.751    0.453    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 9:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.039757   0.003030  13.123   <2e-16 ***
gendermale  -0.006531   0.003432  -1.903    0.057 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 10:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0484900  0.0037022  13.098   <2e-16 ***
gendermale  -0.0007185  0.0044012  -0.163     0.87    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 11:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.037905   0.003634   10.43   <2e-16 ***
gendermale  -0.001265   0.004087   -0.31    0.757    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 12:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.041880   0.003492  11.991   <2e-16 ***
gendermale  -0.002189   0.003890  -0.563    0.574    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 13:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.039330   0.003202  12.284   <2e-16 ***
gendermale  0.003013   0.003785   0.796    0.426    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 14:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0412647  0.0034208   12.06   <2e-16 ***
gendermale  -0.0001557  0.0038750   -0.04    0.968    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 15:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.049990   0.004193  11.921   <2e-16 ***
gendermale  0.001092   0.004531   0.241     0.81    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 16:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.054442   0.003772  14.434   <2e-16 ***
gendermale  0.001196   0.004416   0.271    0.787    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 17:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0697661  0.0045979  15.173   <2e-16 ***
gendermale  -0.0005429  0.0054805  -0.099    0.921    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 18:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.066942   0.003894   17.19   <2e-16 ***
gendermale  -0.006633   0.004364   -1.52    0.129    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 19:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.062449   0.003398  18.379   <2e-16 ***
gendermale  -0.002024   0.003854  -0.525    0.599    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Topic 20:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.026940   0.002947   9.142   <2e-16 ***
gendermale  0.004609   0.003229   1.427    0.154    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot.estimateEffect(x=gender_effect,covariate = 'gender', model=stm_speeches, method = 'difference',
                   cov.value1 = 'male', cov.value2 = 'female', labeltype = 'custom', custom.labels = c('1: M vs W','2: M vs W','3: M vs W','4: M vs W','5: M vs W','6: M vs W','7: M vs W','8: M vs W','9: M vs W','10: M vs W','11: M vs W','12: M vs W','13: M vs W','14: M vs W','15: M vs W','16: M vs W', '17: M vs W', '18: M vs W', '19: M vs W','20: M vs W'),
                    xlab='Proportion',ylab = 'Topic',main='Proportion of topics per gender', n=3,width = 100)

As we can see, the topics in which women are more active than men, the estimate values are, in fact, negative, while the ones where men are more active than women have positive estimate values.

---
title: "Statistical Modelling and Inference - Seminar 5"
author: "Sergio-Yersi Villegas Pelegrín"
output:
  html_notebook:
    code_folding: hide
    toc: yes
    toc_depth: 3
    number_sections: yes
---

```{r message=FALSE, warning=FALSE, include=FALSE, results='hide'}
#install.packages(c("quanteda","quanteda.textplots","stm","lda", "servr", "cld3"))
#install.packages("topicmodels")
```


```{r message=FALSE, warning=FALSE, include=FALSE, results='hide'}
# We first load the required libraries in order to be able to run the code:
library(readr)
library(tidyverse)
library(quanteda) # quantitative analysis of textual data  (https://quanteda.io/articles/quickstart.html)
library(quanteda.textplots) # complementary to quanteda, for visualization
library(cld3) # for language detection
library(lda) # implementation of Latent Dirichlet Allocation
library(servr) # will be used for visualization
library(topicmodels) # alternative to lda, several topic models included
library(stm) # for structural topic modeling
```

# Data managing
We load the speeches made by legislators in the UK House of Commons in the 2014 calendar year, with which we will be working with throughout this seminar, regarding Structural topic modeling in order to understand how parliamentary speeches varies by the gender of the MP.
```{r}
# We load the data that will be used
PATH <- '~/Desktop/TERM 1/SMI/seminar5/data'
load("~/Desktop/TERM 1/SMI/seminar5/data/hoc_speeches.RData")
```

## Summary statistics

Now, we can perform some summary statistics to see the general patterns and a quick explanation of our data.

```{r}
summary(speeches)
```

We can see that there is a signficant, gender-wise class imbalance gender-wise, since female speeches are just a 22% while male speeches correspond to the 78% out of the 7590 speeches contained in the dataset. 

Then, we can check for duplicated speeches contained in the data and see that there are, in fact, some duplicated ones: 7139 unique speeches versus the original 7590 ones.
```{r}
length(unique(speeches$speech))
```

However, since they may have been performed by different members of the party (therefore, maybe performed by men and women), we will not drop them, just to make sure we do not miss information by furtherly imbalancing the dataset.

Finally, we can check the number of speeches performed by each party in the UK House of Commons:

```{r}
speeches %>% 
  group_by(party) %>%
    summarise(observations = length(party))
```
where we can see that 56% have been performed by the Conservative party, 34% by the Labour party and 10% by the Liberal Democrat party, therefore checking which party may have for representatives in the House of Commons (hence, more decision power). 

## Creating and analyzing the corpus data
Following, we will use the functions in the quanteda package to turn this data into a corpus object. We will save the speeches data with the meta parameter of the corpus() function, so we can later access it. Furthermore, we can also check all the detected languages of the speeches, which are logically all in english. 

```{r}
speeches_corpus <- corpus(speeches, text_field = 'speech', meta = speeches)
languages <- detect_language(speeches_corpus)
table(languages)
```

To study the corpus more thoroughly, we can plot the length of each document with the ntoken() function. Here, with the not-cleaned original data, tokens will be words, punctuation characters, numbers… Let's plot it and further analyze it.


```{r results='hide'}
ntokens_speeches <- ntoken(speeches_corpus)
data.frame(ntokens_speeches) %>% ggplot(aes(ntokens_speeches)) + geom_histogram() + xlab('Number of tokens')
```
First, we can see that there are many speeches with less than 250 tokens, approximately (where a lot of them seem to have a null-lenght). Furthermore, we can also see that the maximum amount of tokens that speeches have is around 2500 tokens. However, the amount of speeches with this high number of tokens is very low. Therefore, all things considered, we will filter the speeches, so that we just use those with a number of tokens between 50 and 1000 tokens.

```{r}
docs_length50_1000 <- names(ntokens_speeches[(ntokens_speeches>=50) & (ntokens_speeches<=1000)])
speeches_corpus <- speeches_corpus[names(speeches_corpus) %in% docs_length50_1000]

```

We can now furtherly perform some summary statistics on the filtered data, and see the main features of each speech. We will just show the 10 first elements, so that the output is not unnecessarily long.

```{r}
summary(speeches_corpus, 10)
```

## Cleaning the data
Now, with the filtered data we want to work with, we will firstly turn this corpus into a tokens object and then into a document-feature matrix, with all the preprocessing steps applied to our dataset. More particularly, we will use the following data-cleaning steps:

  - Punctuation, symbols and numbers removal
  - Stopwords removal: we will drop short and s. In order to select them,     we will firstly consider the most common english words (with the function stopwords("en")), and then, we will keep     adding them to the list as we keep searching for the most frequent words.
  - Stemming: we just keep a generic part of a word that may appear in many several specific words, in order to reduce       dimensionality and .
  - Infrequent terms removal: we get rid of the words that rarely appear.
  - n_grams: takes into consideration each word and the combination with all other words; that is, n_grams of n=1 and        n=2.

Then, we can display a small portion of the resulting cleaned data, where we can see the number of times each of the showed words appear in the first considered texts:
```{r}
eng_sw <- stopwords("en")
our_stopwords <- c(eng_sw, 'hon', 'can','s','make','need','one','want','say','take','sure','however',
                   'may','much','must','now','go','use','two','get','said','made','also','look','put','see','give')
our_stopwords
```

```{r}
dfm_speeches <- tokens(speeches_corpus, remove_punct = TRUE, 
                                          remove_symbols = TRUE, remove_numbers = TRUE) %>% 
                            tokens_remove(our_stopwords) %>%
                            tokens_wordstem() %>%
                            tokens_ngrams(n = c(1,2)) %>% 
                            dfm() %>% 
                            dfm_tolower() %>% 
                            dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
dfm_speeches
```


Now, we can plot the length of each document again, now with the filtered, cleaned and preprocessed data, and see the changes we have done:

```{r results='hide'}
ntokens_speeches_2 <- ntoken(dfm_speeches)
data.frame(ntokens_speeches_2) %>% ggplot(aes(ntokens_speeches_2)) + geom_histogram() + xlab('Number of tokens')
```
In order to know if there are more meaningless words that have a high frequency of appearence in the documents, we can show the top $n$ most usual words: by brute force, we can keep appending them to the list and re-run the code, until we find a set of most frequent words that could be considered as meaningful.

```{r}
topfeatures(dfm_speeches, 50)
```

## Word cloud
Once we have already completely cleaned our data, with all the considered stopwords, we can plot the resulting word cloud, with the textplot_wordcloud() function, and see what we obtain.

```{r warning=FALSE}
textplot_wordcloud(dfm_speeches, random_order = FALSE, rotation = 0.25, 
    color = RColorBrewer::brewer.pal(9, "Set2"),max_words =100,max_size = 3)
```

### Bigram word cloud

Now, we will additionally plot a word cloud just for bigrams, in order to see the most frequent-related words. Hence, we firstly create the document-feature matrix for bigrams, with all the preprocessing steps applied to our dataset.

```{r}
bigrams <- tokens(speeches_corpus, remove_punct = TRUE, 
                                          remove_symbols = TRUE, remove_numbers = TRUE) %>% 
                            tokens_remove(our_stopwords) %>%
                            tokens_wordstem() %>%
                            tokens_ngrams(n = c(2)) %>% 
                            dfm() %>% 
                            dfm_tolower() %>% 
                            dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
bigrams
```

Now, we can plot the bigram length of each document and see the changes in comparison with all words:

```{r results='hide'}
bigram_ntokens <- ntoken(bigrams)
data.frame(bigram_ntokens) %>% ggplot(aes(bigram_ntokens)) + geom_histogram() + xlab('Number of bigrams')
```
Finally, we plot the final bigram word cloud, where we have reduced the maximmum number of bigrams to plot, since we saw in the above chart that that the X-axis (number of bigrams) was reduced to a maximmum value of 40.

```{r warning=FALSE}
textplot_wordcloud(bigrams, random_order = FALSE, rotation = 0.25, 
    color = RColorBrewer::brewer.pal(9, "Set2"),max_words =40,max_size = 3)
```

# Structural Topic Model (STM)

## Defining the STM

Following, we will run a structural topic model for this created corpus, using the gender variable as the topic-prevalence argument and setting $K=20$ as the number of topics. With the stm() function that computes this topic modelling, we will obtain $K$ topics which contain the corresponding words that are most related with each other, as we can see in the graph below:

```{r, results='hide'}
stm_speeches <- stm(documents=dfm_speeches, prevalence = ~gender, 
                    K = 20, seed=123)
```

## Plotting the results
```{r, fig.height=6, fig.width=6}
plot(x=stm_speeches, type = 'summary', n=5)
```

For further analyze of each topic and the words it contains, we can show the four different types of word weightings that are used to assing words to each topic: highest probability (most probable words to appear), FREX (both frequent and exclusive), lift and score (from two other popular text mining packages).

```{r}
labelTopics(stm_speeches)
```
## Top 3 documents per topic

In order to find the top three documents associated with each topic, we will use the plotQuote() function, in which we get an image with the top 3 original documents that are mostly related to each topic. We could also use the findThoughts() function (which we have mantained in the coding), but it displays the top 3 documents for each topic all together, therefore having a very ugly visualization display. Hence, we will directly use plotQuote() and nicely display the results, as below:

```{r fig.height=9, fig.width=9}
top3_docs <- findThoughts(model=stm_speeches,texts = speeches_corpus)
for (i in 1:20) {
  plotQuote(speeches_corpus[order(stm_speeches$theta[,i], decreasing = T)[1:3]], width = 150, 
          main = paste('topic: ', i))
}
```

As we can see, if we check each topic and its most relatable words, we will find that the texts are in correspondance with their found associated topics. 

## Topic comparison between men and women



Finally, we can check on which topics are women, on average, more active than men. In order to do that, we will plot the results provided by estimateEffect. Among the many parameters we can fix, the most important one is the method used to plot; we will use "difference", since it clearly displays the comparison between men and women for each topic. By setting the first covariate to 'men' and the second to 'female', by taking the reference point at 0, each topic will be moved towards the right (if men are more active than women for that topic) or towards the left (if woman are more active than men). Furthermore, we can check this criterion by looking at the sign of the estimate value in the summary of the computed variable for the estimateEffect(). Since it displays the effect for men (gendermale), when the line is moved towards the right, the estimate value will be positive. Otherwise, it will be negative when the line is moved towards the left. Let's show the results, so we can verify them.

```{r}
gender_effect <- estimateEffect(formula = ~gender, stmobj = stm_speeches,metadata = 
                                  docvars(dfm_speeches))
summary(gender_effect)
```


```{r, fig.height=6, fig.width=14}
plot.estimateEffect(x=gender_effect,covariate = 'gender', model=stm_speeches, method = 'difference',
                   cov.value1 = 'male', cov.value2 = 'female', labeltype = 'custom', custom.labels = c('1: M vs W','2: M vs W','3: M vs W','4: M vs W','5: M vs W','6: M vs W','7: M vs W','8: M vs W','9: M vs W','10: M vs W','11: M vs W','12: M vs W','13: M vs W','14: M vs W','15: M vs W','16: M vs W', '17: M vs W', '18: M vs W', '19: M vs W','20: M vs W'),
                    xlab='Proportion',ylab = 'Topic',main='Proportion of topics per gender', n=3,width = 100)
```

As we can see, the topics in which women are more active than men, the estimate values are, in fact, negative, while the ones where men are more active than women have positive estimate values. 


